{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e04c382a-7c49-452b-b9bf-e448951c64fe",
   "metadata": {},
   "source": [
    "# Building a Regex Data Labeler w/ your own Regex"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fb3ecb9-bc51-4c18-93d5-7991bbee5165",
   "metadata": {},
   "source": [
    "This notebook teaches how to use the existing / create your own regex labeler as well as utilize it for structured data profiling.\n",
    "\n",
    "1. Loading and utilizing the pre-existing regex data labeler\n",
    "1. Replacing the existing regex rules with your own.\n",
    "1. Utilizng a regex data labeler inside of the structured profiler\n",
    "\n",
    "First, let's import the libraries needed for this example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a67c197b-d3ee-4896-a96f-cc3d043601d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import json\n",
    "from pprint import pprint\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "try:\n",
    "    import dataprofiler as dp\n",
    "except ImportError:\n",
    "    sys.path.insert(0, '../..')\n",
    "    import dataprofiler as dp"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c71356f4-9020-4862-a1e1-816effbb5443",
   "metadata": {},
   "source": [
    "## Loading and using the pre-existing regex data labeler\n",
    "We can easily import the exsting regex labeler via the `load_from_library` command from the `dp.DataLabeler`. This allows us to import models other than the default structured / unstructured labelers which exist in the library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "113d6655-4bca-4d8e-9e6f-b972e29d5684",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_labeler = dp.DataLabeler.load_from_library('regex_model')\n",
    "data_labeler.model.help()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b405887-2b92-44ca-b8d7-29c384f6dd9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "pprint(data_labeler.label_mapping)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11916a48-098c-4056-ac6c-b9542d85fa86",
   "metadata": {},
   "outputs": [],
   "source": [
    "pprint(data_labeler.model._parameters['regex_patterns'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da0e97ee-8d6d-4631-9b55-78ed904d5f41",
   "metadata": {},
   "source": [
    "### Predicting with the regex labeler\n",
    "In the prediction below, the default settings will `split` the predictions by default as it's aggregation function. In other words, if a string '123 Fake St.' The first character would receive a vote for integer and for address giving both a 50% probability. This is because these regex functions are defined individually and a post prediction aggregation function must be used to get the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe519e65-36a7-4f42-8314-5369de8635c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# evaluate a prediction using the default parameters\n",
    "data_labeler.predict(['123 Fake St.'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b41d834d-e47b-45a6-8970-d2d2033e2ade",
   "metadata": {},
   "source": [
    "## Replacing the regex rules in the existing labeler\n",
    "\n",
    "We can achieve this by:\n",
    "1. Setting the label mapping to the new labels\n",
    "2. Setting the model parameters which include: `regex_patterns`, `default_label`, `ignore_case`, and `encapsulators`\n",
    "\n",
    "where `regex_patterns` is a `dict` of lists or regex for each label, `default_label` is the expected default label for the regex, `ignore_case` tells the model to ignore case during its detection, and `encapsulators` are generic regex statements placed before (start) and after (end) each regex. Currently, this is used by the default model to capture labels that are within a cell rather than matching the entire cell. (e.g. ' 123 ' will still capture 123 as digits)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6bb010a-406f-4fd8-abd0-3355a5ad0ded",
   "metadata": {},
   "source": [
    "Below, we created 4 labels where `other` is the `default_label`. Additionally, we set enabled case sensitivity such that upper and lower case letters would be detected separately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f86584cf-a7af-4bae-bf44-d87caa68833a",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_labeler.set_labels({'other': 0, 'digits':1, 'lowercase_char': 2, 'uppercase_chars': 3})\n",
    "data_labeler.model.set_params(\n",
    "    regex_patterns={\n",
    "        'digits': [r'[+-]?[0-9]+'],\n",
    "        'lowercase_char': [r'[a-z]+'],\n",
    "        'uppercase_chars': [r'[A-Z]+'],\n",
    "    },\n",
    "    default_label='other',\n",
    "    ignore_case=False,\n",
    ")\n",
    "data_labeler.label_mapping"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ece1c8c-18a5-46fc-b563-6458e6e71e53",
   "metadata": {},
   "source": [
    "### Predicting with the new regex labels\n",
    "\n",
    "Here we notice the otuput of the predictions gives us a prediction per character for each regex. Note how by default it is matching subtext due to the encapsulators. Where `123` were found to be digits, `FAKE` was foudn to be upper case, and the whitespaces and `St.` were other due no single regex being correct."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92842e14-2ea6-4879-b58c-c52b607dc94c",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_labeler.predict(['123 FAKE St.'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ce14e54-094f-41ff-9ce0-69acace6abc2",
   "metadata": {},
   "source": [
    "Below we turn off case-sensitivity and we see how the aggregation funciton splits the votes for characters between the `lowercase` and `uppercase` chars."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7b8ed9d-c912-4dc7-82c5-ba78a3affc1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_labeler.model.set_params(ignore_case=True)\n",
    "data_labeler.predict(['123 FAKE St.'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc66515f-24e4-40f0-8592-b1ee4fba7077",
   "metadata": {},
   "source": [
    "For the rest of this notebook, we will just use a single regex serach which will capture both upper and lower case chars."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e0c1b11-d111-4080-873f-40aff7cf7930",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_labeler.set_labels({'other': 0, 'digits':1, 'chars': 2})\n",
    "data_labeler.model.set_params(\n",
    "    regex_patterns={\n",
    "        'digits': [r'[=-]?[0-9]+'],\n",
    "        'chars': [r'[a-zA-Z]+'],\n",
    "    },\n",
    "    default_label='other',\n",
    "    ignore_case=False,\n",
    ")\n",
    "data_labeler.label_mapping"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28e7b2ee-c661-4b31-b727-078f1393b5c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_labeler.predict(['123 FAKE St.'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f60c8fd1-76e1-469f-9e5a-62d7529301b3",
   "metadata": {},
   "source": [
    "### Adjusting postprocessor properties\n",
    "\n",
    "Below we can look at the possible postprocessor parameters to adjust the aggregation function to the desired output. The previous outputs by default used the `split` aggregation function, however, below we will show the `random` aggregation function which will randomly choose a label if multiple labels have a vote for a given character."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36afa82b-1ca5-49ad-9aa9-84c6de621f59",
   "metadata": {},
   "source": [
    "data_labeler.postprocessor.help()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66840940-47bf-433a-8ee8-977f26926e0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_labeler.postprocessor.set_params(aggregation_func='random')\n",
    "data_labeler.predict(['123 FAKE St.'], predict_options=dict(show_confidences=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c32b74fc-5051-4d53-b02a-4d1e4a35958f",
   "metadata": {},
   "source": [
    "## Integrating the new Regex labeler into Structured Profiling\n",
    "\n",
    "While the labeler can be used alone, it is also possible to integrate the labeler into the StructuredProfiler with a slight change to its postprocessor. The StructuredProfiler requires a labeler which outputs othe confidence of each label for a given cell being processed. To convert the output of the `RegexPostProcessor` into said format, we will use the `StructRegexPostProcessor`. We can create the postprocessor and set the `data_labeler`'s postprocessor to this value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2663f2d-29a2-41ed-88dd-8a213d303365",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprofiler.labelers.data_processing import StructRegexPostProcessor\n",
    "\n",
    "postprocesor = StructRegexPostProcessor()\n",
    "data_labeler.set_postprocessor(postprocesor)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7352769-d636-42c6-9706-7d9cff520a72",
   "metadata": {},
   "source": [
    "Below we will see the output is now one vote per sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18814634-0fd0-4ce8-b0c3-9b9454701a43",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_labeler.predict(['123 FAKE St.', '123', 'FAKE'], predict_options=dict(show_confidences=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4aa4e36-7362-4966-b827-3f5a6f2dfa7c",
   "metadata": {},
   "source": [
    "### Setting the Structuredprofiler's DataLabeler\n",
    "\n",
    "We can create a `ProfilerOption` and set the structured options to have the new data_labeler as its value. We then run the StructuredProfiler with the specified options."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f18cf7f-283e-4e54-b3f9-1312828c3029",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create and set the option for the regex data labeler to be used at profile time\n",
    "profile_options = dp.ProfilerOptions()\n",
    "profile_options.set({'structured_options.data_labeler.data_labeler_object': data_labeler})\n",
    "\n",
    "# profile the dataset using the suggested regex data labeler\n",
    "data = pd.DataFrame(\n",
    "    [['123 FAKE St.', 123, 'this'], \n",
    "     [123           ,  -9, 'IS'], \n",
    "     ['...'         , +80, 'A'], \n",
    "     ['123'         , 202, 'raNDom'], \n",
    "     ['test'        ,  -1, 'TEST']], \n",
    "    dtype=object)\n",
    "profiler = dp.Profiler(data, options=profile_options)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "663e49f7-358b-4b0f-99a4-1823908ef990",
   "metadata": {},
   "source": [
    "Below we see the first column is given 3 labels as it received multiple votes for said column. However, it was confident on the second and third column which is why it only specified `digits` and `chars` respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f796d7f5-7e8a-447b-9cbb-d5b8180660a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "pprint(profiler.report(\n",
    "    dict(output_format='compact', \n",
    "         omit_keys=['data_stats.*.statistics', \n",
    "                    'data_stats.*.categorical', \n",
    "                    'data_stats.*.order', \n",
    "                    'global_stats'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "261b903f-8f4c-403f-839b-ab8813f850e9",
   "metadata": {},
   "source": [
    "## Saving the Data Labeler for future use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6ffbaf2-9400-486a-ba83-5fc9ba9334d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.isdir('my_new_regex_labeler'):\n",
    "    os.mkdir('my_new_regex_labeler')\n",
    "data_labeler.save_to_disk('my_new_regex_labeler')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09e40cb6-9d89-41c4-ae28-3dca498f8c68",
   "metadata": {},
   "source": [
    "## Loading the saved Data Labeler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52615b25-70a6-4ebb-8a32-14aaf1e747d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "saved_labeler = dp.DataLabeler.load_from_disk('my_new_regex_labeler')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d1ccc0b3-1dc2-4847-95c2-d6b8769b1590",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ensuring the parametesr are what we saved.\n",
    "print(\"label_mapping:\")\n",
    "pprint(saved_labeler.label_mapping)\n",
    "print(\"\\nmodel parameters:\")\n",
    "pprint(saved_labeler.model._parameters)\n",
    "print()\n",
    "print(\"postprocessor: \" + saved_labeler.postprocessor.__class__.__name__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c827f2ae-4af6-4f3f-9651-9ee9ebea9fa0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# predicting with the loaded labeler.\n",
    "saved_labeler.predict(['test', '123'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "606f9bbf-5955-4b7b-b0d1-390de5600f73",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
